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Convolution neural networks (CNN or ConvNet), a deep neural network class 
inspired by biological processes, are immensely used for image classification 
or visual imagery. These networks need various parameters or attributes like 
number of filters, filter size, number of input channels, padding stride and 
dilation, for doing the required task. In this paper, we focused on the 


hyperparameter, i.e., filter size. Filter sizes come in various sizes like 3x3, 


5x5, and 7x7. We varied the filter sizes and recorded their effects on the 
models' accuracy. The models' architecture is kept intact and only the filter 
sizes are varied. This gives a better understanding of the effect of filter sizes 
on image classification. CIFAR10 and FashionMNIST datasets are used for 
this study. Experimental results showed the accuracy is inversely proportional 
to the filter size. The accuracy using 3x3 filters on CIFAR10 and Fashion- 
MNIST is 73.04% and 93.68%, respectively. 
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1. INTRODUCTION 

Artificial intelligence (AI) has minimized the gap between human and machine capabilities. The 
researchers and enthusiasts are doing enormous work on numerous aspects of the field to make extraordinary 
things happen in many domains. The domain of computer vision is one such area. Various recognition or 
classification algorithms were developed like support vector machines (SVM), neural networks (NN), multi- 
level perceptrons (MLP), and many more. A lot of machine learning algorithms have been proposed to achive 
the task of character recognition [1]-[3]. These advancements helped machines to perceive image and video 
recognition, natural language processing, and much more. Deep learning is enormously perfected with time, 
primarily over convolutional neural networks (CNN), and gives much better results than the others [4]-[7]. 
CNN is now a well-established machine learning tool used for image classification problems, especially in 
medical sciences and practical life like detecting road signs, recognizing human activity, and facial 
expression recognition [8], [9]. 

CNN is a deep learning algorithm that takes an image as input and recognizes it. The architecture of 
CNN is based on human neurons. These are feed-forward neural networks that can capture the temporal and 
spatial dependencies by applying relevant filters. The network can extract features without manual handling 
[10]. CNN isused for a wider range of image recognition task like medical image recognition [11], [12], x 
rays recognition [13], [14], handwritten character recognition[15]-[17], and offline character recognition 
[18], [19]. A CNN has the input layer, convolutional layer, a max-pooling layer, and a fully connected layer. 
In the end, we have the SoftMax applied to classify the image to its respective class [20]. The convolutional 
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layer is a 2D layer, and the filters are used in this layer. CNN input layer has three dimensions. The input 
image is a three-dimensional image comprising height, width, and depth. The depth is taken as one for the 
greyscale image and three for the RGB image. The filters are the feature matrixes of various sizes, e.g. 3 X 3, 
5 x 5, and 7 x 7. The filters are represented by the vector of weights, which tries to find whether the features 
are available in the input image. The vector of the weights represents a feature such as a curve, edge, and 
shape. We have considered three types of filters: 3 x 3, 5 x 5, and 7 X 7, and checked the impact of these 
filters on the image classification. 

In the convolutional layer, the output is calculated as a dot product of the weights and the input and 
then adds some bias. The filters slide over the input image matrix and produce the convolutional output. If 
the input is having dimensions of H X W xX 1 and the N is the number of filters applied, then the output of 
the convolutional layer is H Xx W XN. The stride is applied to determine the number of pixels skipped for 
the next convolution. If stride is one, then the filter moves one pixel and calculates the convolutional output. 
If stride is a considerable value, then the convolutional output complexity decreases, but it also affects the 
accuracy. The general rule suggested is to take the stride's value less than twice the filter size [21]-[23]. A 
pooling layer is used for the downsampling and is applied along the spatial dimensions. If the stride is 2, then 
the resulting volume is H — 2X W-—2 XN. The fully connected layer computes the valid class score. In 
this layer, every neuron is connected to every other neuron in the previous layer, and the output size will be 
1x1 XM, where M is the number of output classes, e.g., for the handwritten digit classification M is 10 for 
the cifar10 dataset M is 10 and for cifar100 dataset M is 100. 

The hyperparameter called filter sizes is taken into consideration in this study. A filter is a 2d square 
matrix that is applied to every convolutional layer. These matrixes are of various sizes like 3 x 3, 5 X 5, and 
7 X 7. We varied these filter sizes keeping all the other hyperparameters the same to see their effect on the 
accuracy and to see which ones perform the best among which circumstances. 


2. DATASETS 

We performed the experiments with the following two datasets i) CIFARIO [24] and ii) 
FashionMNIST [25] dataset. These two datasets are small and precise datasets having low computational 
costs. The results drawn from these datasets can be applied to most of the datasets. So, we use these two 
datasets to avoid the computation cost of more extensive datasets. CIFAR1O is the dataset of the 60000 
images of these categories plane, car, bird, cat, deer, dog, frog, horse, and ship. This dataset has 10 output 
classes. The second one is the FashionMNIST dataset that contains 10 different categories, so the output 
classes are 10. All the datasets have 60000 images, out of which we use 40000 for the training and 10000 for 
the validation, and the rest of 1000 for testing the accuracy of the model. The dataset is split into a validation 
dataset to train our model efficiently. The validation dataset ensures the model does not cause overfitting or 
underfitting on training data and performs best on test data. 


3. ARCHITECTURE 

For both the datasets, CIFAR10O as well as MNIST dataset, we build a CNN model with thirteen 
layers. The first layer is the input layer, tracked by two Conv2D layers. In each layer, 32 filters are used. 
Then batch normalization is used for standardizing the inputs followed by the MaxPooling2D layer, a (2, 2) 
pool is applied. The dropout layer is used for protecting the model from overfitting, and then the Flatten layer 
is used to flatten the inputs. Two Dropout layers are used in the whole model, and then at the send, one dense 
layer is used. Two types of activation functions are used. One is ReLU, and at the end, SoftMax is used as an 
activation function. Adam optimizer is employed for optimizing the model with the learning rate of 0.001, 
and the categorical cross-entropy function is used to calculate the error. The same model is used with all 
three different filter sizes, and the effect on the accuracy is measured. The architecture of the models is kept 
unchanged to track the effect of filter sizes on the accuracy and loss of the model. 


4. EXPERIMENTAL RESULTS 

For the experiment, we used a simple convolution neural network with a total of 13 layers and just 
two Conv2D layers. The more layers we use in CNN, the more accuracy we get, but it is not always true the 
model could run into overfitting. Also, the computational cost increases effectively with each layer. We have 
to find a balance between the number of layers we use. The architecture is given in 

Figure /. In the first experiment, both the convolutional layers are trained with a 3 x 3 filter, and in 
each layer, 32 filters are used. In the second experiment, we have used the same architecture the number of 
the filters is kept unchanged, only the filter sizes are changed to 5 X 5, and in the third experiment, the filter 
size was altered to 7X 7, and the rest of the network is the same. 
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Figure 1. Architecture of CNN 


Table 1. Accuracy of the CIFARIO dataset using different filter sizes shows the accuracy of the 
training, validation, and test dataset for CIFAR10 using different filter sizes, and Table 2 shows the loss on 
the CIFAR10 dataset using different filter sizes. The accuracy on the CIFAR1O dataset using a filter size of 
3 x 3 is highest, 1.e., 94.26% on the training dataset, 72.75% validation dataset, and 63.5% on the test 
dataset, and the lowest accuracy is when 7 X 7 filters are used. All the model configurations are trained on 50 
epochs, and the number of filters used is 32. The accuracy and the loss during training and validation are 
shown in Figures 2 and 3, respectively. 


Table 1. Accuracy of the CIFAR10 dataset using Table 2. Loss of the CIFAR10 dataset different filter 


different filter sizes sizes 
Filter Size Training Data___ Validation Data__ Test Data Filter Size Training Data__- Validation Data__ Test Data 
3x3 0.942625 0.7275 0.7304 3x3 0.1659 1.02 1.01 
5x5 0.923275 0.7261 0.7297 5x5 0.2209 0.97 0.97 
7X7 0.87725 0.7067 0.635 7X7 0.3377 1.01 2.97 


It is clear from Figure 2 the accuracy is increasing as the epochs increase, but the rate of increase 
decreases and reaches a point above which the accuracy doesn't change. The accuracy curve of the 3 x 3 
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filter is above the 7 X 7 and 5 X 5 filter. The validation accuracy looks the same as training accuracy, but 
there are many spikes, but lesser filter size curves are higher than the bigger filter sizes. Table 1 shows that 
accuracy is the highest with a 3 x 3 filter on the test data. So, it is concluded that the smaller filter sizes tend 
to give better accuracy than using larger filter sizes. 
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Figure 2. Training accuracy and validation accuracy using different filter sizes on the CIFAR10 dataset 


Figure 3 shows that the training loss is given minimum by the 3 x 3 filter and the highest by the 
7 X 7 filter. With the validation loss, it looks a bit different, but not too much. There are many spikes, but if 
we see the generality, the loss decreases as we use the smaller filter size and increases if we use the larger 
filter size. The second experiment is done on the FashionMNIST dataset with the same network configuration 
as done on the CIFARIO dataset. Thirty-two filters with three different filter sizes are used in both the 
Conv2D layers, and the training, validation, and test accuracy and loss are measured. 


Loss and Validation Loss based on Filter Sizes 
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Figure 3. Training loss and validation loss using different filter sizes on the CIFAR10 dataset 


Table 3 shows, the accuracy of the training dataset, validation dataset, and test dataset is maximum, 
i.e., 92.68%, 92.35, and 92.68%, respectively, when 3 xX 3 filters are used and lowest, i.e., 91.1% when 7 x 7 
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filters are used. Table 4 shows the loss of the training dataset, validation dataset, and test dataset is maximum 
for 3 x 3 filters while in the lowest a 7 X 7 filters are used. Accuracy graphs for FashionMNIST datset is 
plotted in Figure 4. The 5X5 and 3 X3 curves are almost the same on the training data, and their 
performances are the same, but the result is seen in the validation and test datasets where the 3 x 3 filter 
outperforms the other filters. 


Table 3. Accuracy of FashionMNIST dataset using Table 4. Loss of FashionMNIST dataset using 
different filter sizes different filter sizes 
Filter size Training data‘ Validation data _‘Test data Filter size Training data _ Validation data Test data 
3x3 0.929 0.9235 0.9268 3x3 0.185 0.217 0.200 
5x5 0.926 0.9196 0.9264 5x5 0.189 0.228 0.202 
7X7 0.918 0.910 0.911 7X7 0.216 0.256 0.243 


The loss curve for FashionMNIST dataset using different filter sizes is plotted in Figure 5. In both, 
the graphs loss is minimum when the 3 x 3 filter is used and maximum when 7 X 7 is used. 
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Figure 4. Training accuracy and validation accuracy using different filter sizes on the FashionMNIST dataset 
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Figure 5. Training loss and validation loss using different filter sizes on the FashionMNIST dataset 
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5. CONCLUSION 

In this paper, different CNN structures are used for image classification. The experiments have been 
applied on CFIR10 and FashionMNIST datasets, and we used three different filters size, and the number of 
filters is kept constant 32. From the experiment results, filter size has a major effect on classification 
accuracy, the accuracy of training, and validation. Accuracy with filter size 3x3 is higher than 5x5 and 7x7, 
but the training time is high when a smaller filter size is used. Besides, when filter size 3x3 pixels, testing 
accuracy is better than filter size 5x5 pixels and 7x7 pixels. The accuracy shown by using a 3x3 filter size in 
both the datasets is 92.68% in FashionMNIST and 73.04% in the CIFAR1O dataset. The only trade of using 
less filter size is the computational cost associated with the model. 
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